Method for generating video synopsis through scene understanding and system therefor

ABSTRACT

Embodiments relate to a method for generating a video synopsis including receiving a user query; performing an object based analysis of a source video; and generating a synopsis video in response to a video synopsis generation request from a user, and a system therefor. The video synopsis generated by the embodiments reflects the user&#39;s desired interaction.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to Korean Patent Application No.10-2019-0114149, filed on Sep. 17, 2019, and all the benefits accruingtherefrom under 35 U.S.C. § 119, the contents of which in its entiretyare herein incorporated by reference.

BACKGROUND 1. Field

Embodiments of the present disclosure relate to video synopsisgeneration, and more particularly, to a method for generating a user'sdesired video synopsis based on understanding of a scene by determininginteraction in the scene and a system therefor.

[Description about National Research and Development Support]

This study was supported by the National Research Foundation of Korea(Project Name. Development core technologies for complex recognition tooptimally analyze and infer identify according to dynamic changes inspace-time/view, Project No. 1711094167) under the superintendence ofMinistry of Science, ICT and Future Planning, Republic of Korea.

2. Description of the Related Art

Imaging devices such as closed-circuit television (CCTV) cameras, blackboxes, etc., are widely used in modern daily life. Videos captured bythe imaging devices are usefully used in a wide range of applications,especially in security applications such as surveillance and criminalinvestigation. For example, to efficiently find the movement paths ofsuspects or missing people, videos captured by multiple imaging devicesare used. However, in the case of long-time videos, monitoring thevideos to the end is inconvenient.

To overcome the inconvenience, Korean Patent Publication No.10-2008-0082963 discloses a method that extracts moving objects includedin a video to generate a synopsis video for the moving objects. Thesynopsis video makes it possible to see the moving objects in briefwithin a short time without seeing the full video.

However, the above related art generates a video synopsis based onactivities of the objects (for example, velocity, speed, direction ofthe objects) included in the video. Thus, there is a high possibilitythat a video synopsis including many objects unnecessary to a user willbe generated, and generating a customized video synopsis is a challenge.

On the other hand, there is an attempt to generate a video synopsisbased on not only the activities of the objects included in the videobut also appearance (for example, color, size, etc.) of the objects(Korean Patent Publication No. 10-2009-0117771). As this is furtherbased on features of specific objects for which a user desires tosearch, the generated video synopsis has reduced objects unnecessary tothe user.

However, a limitation of this attempt is generation of a video synopsisusing only the features of the objects such as activities andappearance.

SUMMARY

According to an aspect of the present disclosure, there is provided asystem for generating a video synopsis more suitable for a user's needbased on understanding of a scene in a video including a specific objectby determining interaction between the specific object and thebackground or interaction between the specific object and other object.

There is further provided a method for generating a video synopsis usingthe scene understanding information and a computer-readable recordingmedium having the method recorded thereon.

A system for generating a video synopsis according to an aspect of thepresent disclosure may include a source object detection unit configuredto detect at least one source object in a source video including atleast one object, a motion detection unit configured to detect a motionof the source object in the source video, a tube generation unitconfigured to generate a source object tube including the source objecton which the motion is detected, a scene understanding unit configuredto determine interaction associated with the source object of the tube,and a video synopsis generation unit configured to generate a videosynopsis based on the source object tube associated with the determinedinteraction.

In an embodiment, the source object detection unit may detect the sourceobject through an object detection model, and the object detection modelmay be pre-learned to extract a feature for detecting an object from aninput image and determine a class corresponding to the object includedin the input image.

In an embodiment, the object detection model may be configured to set aproposed region where the object is located in the input image, anddetermine the class of the object in the region by extracting thefeature from the set image.

In an embodiment, the object detection model may be configured to detectthe source object by extracting the feature for detecting the objectfrom the input image and determining the class to which each pixelbelongs.

In an embodiment, the object detection model may include a firstsubmodel to determine a region of interest (ROI) by detecting a positionof the object in the input image, and a second submodel to mask theobject included in the ROI.

In an embodiment, the source object detection unit may be furtherconfigured to extract a background from the source video.

In an embodiment, the source object detection unit may be configured toextract the background through a background detection model thatdetermines at least one class regarded as the background for each pixel.

In an embodiment, the source object detection unit may extract thebackground by cutting a region occupied by the source object in thesource video.

In an embodiment, the system may further include a background database(DB) to store the extracted background of the source video.

In an embodiment, the motion detection unit may be configured to computetracking information of the source object by tracking a specific objectin a subset of frames in which the specific source object is detected.

In an embodiment, tracking information of the source object may includeat least one of whether moving or not, a velocity, a speed, and adirection.

In an embodiment, the tube generation unit may be configured to generatethe source object tube based on a subset of frames representing anactivity of the source object, or a combination of the subset of framesrepresenting the activity of the source object and a subset of framesrepresenting the source object.

In an embodiment, the tube generation unit may be further configured tofilter a region including the source object in the frame of the sourcevideo, and generate the source object tube in which at least part ofbackground of the source video is removed.

In an embodiment, when a image region including the source object isextracted as a result of the detection of the source object, the tubegeneration unit may be further configured to generate the source objecttube using the extracted image region instead of the filtering.

In an embodiment, the scene understanding unit may be configured todetermine the interaction associated with the source object of the tubethrough an interaction determination model pre-learned to determineinteraction associated with an object included in an input image. Here,the interaction determination model includes a convolution network.

In an embodiment, the interaction determination model is learned topreset an associable interaction class with the object as a subject ofan action that triggers the interaction.

In an embodiment, the interaction determination model may be furtherconfigured to receive image having a size including a first object asthe input image, extract a first feature, receive image having a sizeincluding a region including the first object and a different region asthe input image, extract a second feature, and determine interactionassociated with the first object based on the first feature and thesecond feature.

In an embodiment, the different region may be a region including asecond object that is different from the first object, or a background.

In an embodiment, the first feature may be a feature extracted to detectthe source object.

In an embodiment, the interaction determination model may be configuredto determine a class of the interaction by detecting an activity of aspecific object which is a subject of an action triggering theinteraction and detecting a different element associated with theinteraction.

In an embodiment, the interaction determination model may include anactivity detection network to detect the activity of the specific objectin the input image, and an object detection network to detect an objectthat is different from the activity object by extracting a feature fromthe input image. Additionally, the activity detection network may beconfigured to extract the feature for determining a class of theactivity appearing in the video, and the object detection network may beconfigured to extract the feature for determining a class of the objectappearing in the video.

In an embodiment, the feature for determining the class of the activitymay include a pose feature, and the feature for determining the class ofthe object may include an appearance feature.

In an embodiment, the interaction determination model may be furtherconfigured to link a set of values computed by the activity detectionnetwork and a set of values computed by the object detection network togenerate an interaction matrix, and determine the interaction associatedwith the specific object in the input image based on the activity andthe object corresponding to a row and a column of an element having ahighest value among elements of the interaction matrix.

In an embodiment, the tube generation unit may be further configured tolabel the source object tube with at least one of source object relatedinformation or source video related information as a result of thedetection of the source object, and interaction determined to beassociated with the source object of the tube.

In an embodiment, the system may further include a source DB to store atleast one of the source object tube and the labeled data.

In an embodiment, the video synopsis generation unit may be configuredto generate the video synopsis in response to a user query including asynopsis object and a synopsis interaction to be required for synopsis.

In an embodiment, the video synopsis generation unit may be furtherconfigured to determine the source object corresponding to the synopsisobject in the detected source object, and select the source object forgenerating the video synopsis by filtering the source object associatedwith the interaction corresponding to the synopsis interaction in theselected source object.

In an embodiment, the video synopsis generation unit may be furtherconfigured to determine a start time of selected synopsis object tubeswith minimized collision between the selected synopsis object tubes.

In an embodiment, the video synopsis generation unit may be furtherconfigured to group tubes having a same synopsis interaction to generatea group for each interaction when the user query includes multiplesynopsis interactions, and determine a start time of each group forarrangement with minimized collision between each group.

In an embodiment, the video synopsis generation unit may be furtherconfigured to determine the start time of the selected synopsis objecttubes in a same group based on a shoot time in the source video.

In an embodiment, the video synopsis generation unit may be configuredto generate a synopsis video based on arranged synopsis object tubes anda background.

In an embodiment, the video synopsis generation unit may be furtherconfigured to stitch the source object tubes in which at least part ofthe background of the source video is removed with the background of thesource video.

A method for generating a video synopsis using a tube of a source objectdetected in a source video and interaction, performed by a computingdevice including a processor, according to another aspect of the presentdisclosure may include receiving a user query including a synopsisobject and a synopsis interaction to be required for synopsis, acquiringa source object and an interaction corresponding to the user query,selecting the source object associated with the interactioncorresponding to the user query as a tube for synopsis, arranging theselected tubes, and generating a video synopsis based on the selectedtubes and a background.

In an embodiment, the source object is detected in the source videothrough an object detection model, and the object detection model ispre-learned to extract a feature for detecting an object in an inputimage and determine a class corresponding to the object included in theinput image.

In an embodiment, the background includes a background of the sourcevideo extracted by detecting the object in the source video.

In an embodiment, the tube of the source object may be generated basedon a subset of frames representing an activity of the source object, ora combination of the subset of frames representing the activity and asubset of frames representing the source object.

In an embodiment, the frame representing the activity may be determinedbased on tracking information including at least one of whether movingor not, a velocity, a speed, and a direction.

In an embodiment, the tube of the source object may be generated byfiltering a region including the source object in the frame of thesource video.

In an embodiment, when a image region including the source object isextracted as a result of the detection of the source object, the tube ofthe source object may be generated using the extracted image regioninstead of the filtering.

In an embodiment, the interaction associated with the source object ofthe tube may be determined through an interaction determination modelpre-learned to determine interaction associated with an object includedin an input image.

In an embodiment, the interaction determination model may be configuredto receive image having a size including a first object as the inputimage, extract a first feature, receive image having a size including aregion including the first object and a different region as the inputimage, extract a second feature, and determine interaction associatedwith the first object based on the first feature and the second feature.

In an embodiment, the interaction determination model may include anactivity detection network to detect an activity of the specific objectin the input image, and an object detection network to detect an objectthat is different from the activity object by extracting a feature fromthe input image, the activity detection network may be configured toextract the feature for determining a class of the activity appearing inthe video, and the object detection network may be configured to extractthe feature for determining a class of the object appearing in thevideo.

In an embodiment, selecting the source object associated with theinteraction corresponding to the user query may include determining thesource object corresponding to the synopsis object in the detectedsource object, and selecting the source object for generating the videosynopsis by filtering the source object associated with the interactioncorresponding to the synopsis interaction in the selected source object.

In an embodiment, arranging the selected tubes may include determining astart time of the selected synopsis object tubes with minimizedcollision between the selected synopsis object tubes.

In an embodiment, arranging the selected tubes may include groupingtubes having a same synopsis interaction to generate a group for eachinteraction when the user query includes multiple synopsis interactions,and determining a start time of each group for arrangement withminimized collision between each group.

A computer-readable recording medium according to still another aspectof the present disclosure may have stored thereon computer-readableprogram instructions that run on a computer. Here, when the programinstructions are executed by a processor of the computer, the processorperforms the method for generating a video synopsis according to theembodiments.

The video synopsis system according to an aspect of the presentdisclosure may generate a video synopsis reflecting interaction byselecting tubes for video synopsis based on a source object detected ina source video, motion of the source object and interaction associatedwith the source object.

As above, the video synopsis is generated based on not only low-levelvideo analysis involving analyzing the presence or absence of an objectin a video and the appearance of the object, but also high-level videoanalysis involving analyzing interaction between a target object andother object (for example, a different object or a background) locatedoutside of the target object.

Accordingly, the video synopsis system may generate a better customizedvideo synopsis with minimized unnecessary information other than thetarget object that the user wants to see.

As a result, the video synopsis system maximizes the user conveniencewithout departing from the user's need, and minimizes the capacity ofthe video synopsis.

The effects of the present disclosure are not limited to theabove-mentioned effects, and other effects not mentioned herein will beclearly understood by those skilled in the art from the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The following is a brief introduction to necessary drawings in thedescription of the embodiments to describe the technical solutions ofthe embodiments of the present disclosure or the existing technologymore clearly. It should be understood that the accompanying drawings arefor the purpose of describing the embodiments of the present disclosureand not intended to be limiting of the present disclosure. Additionally,for clarity of description, the accompanying drawings may show somemodified elements such as exaggerated and omitted elements.

FIG. 1 is a schematic block diagram of a system for generating a videosynopsis according to an embodiment of the present disclosure.

FIG. 2 is a diagram illustrating a fully-supervised interactiondetermination model according to an embodiment of the presentdisclosure.

FIG. 3 is a diagram illustrating a semi-supervised interactiondetermination model according to an embodiment of the presentdisclosure.

FIG. 4 is a diagram illustrating a process of training the interactiondetermination model of FIG. 3.

FIG. 5 is a conceptual diagram of a video synopsis generated by a systemfor generating a video synopsis according to an embodiment of thepresent disclosure.

FIG. 6 is a flowchart of a method for generating a video synopsisaccording to an embodiment of the present disclosure.

FIG. 7 is a flowchart illustrating a source video analysis processaccording to an embodiment of the present disclosure.

FIG. 8 is a diagram illustrating a synopsis video generation processaccording to an embodiment of the present disclosure.

DETAILED DESCRIPTION

The terminology used herein is for the purpose of describing particularembodiments and not intended to be limiting of the present disclosure.Unless the context clearly indicates otherwise, the singular forms asused herein include the plural forms as well. The term “comprises” or“includes” when used in this specification, specifies the presence ofstated features, regions, integers, steps, operations, elements, and/orcomponents, but does not preclude the presence or addition of one ormore other features, regions, integers, steps, operations, elementsand/or components.

Unless otherwise defined, all terms (including technical and scientificterms) used herein have the same meaning as commonly understood by thoseskilled in the art. It is further understood that terms, such as thosedefined in commonly used dictionaries, should be interpreted as having ameaning that is consistent with their meaning in the context of therelevant art document and the present disclosure, and will not beinterpreted in an idealized or overly formal sense unless expressly sodefined herein.

Hereinafter, the embodiments of the present disclosure will be describedin detail with reference to the accompanying drawings.

A system and method for generating a video synopsis according toembodiments of the present disclosure may understand a scene by analysisof interaction between a first object and a background in a source videoor interaction between the first object and a second object that isdifferent from the first object, and generate a video synopsis based onthe scene understanding. Accordingly, in response to receiving a videosynopsis generation request from a user, it is possible to generate avideo synopsis taking the user's need into further consideration,compared to earlier embodiments for generating a video synopsis usingfeatures descriptive of the object itself in the source video, such asactivity and appearance of the object.

In the specification, the source video includes an object (a sourceobject) in at least some of frames, and the object is represented in thespatio-temporal domain. The concatenation of images representing theobject or activity across successive frames of a video are referred toas a “tube”. As the object is represented by the tube in thespatio-temporal volume, the terms “object” and “tube” are usedinterchangeably in the following description.

FIG. 1 is a schematic block diagram of a system for generating a videosynopsis according to an embodiment of the present disclosure.

Referring to FIG. 1, the system 1 for generating a video synopsisincludes a video analysis unit 10 to analyze a source video, and a videosynopsis generation unit 50 to generate a synopsis video in response toa video synopsis generation request from a user. Additionally, thesystem 1 for generating a video synopsis may further include an imagingdevice (not shown) to capture the source video.

The system 1 for generating a video synopsis according to embodimentsmay have aspects of entirely hardware, entirely software, or partlyhardware and partly software. For example, a product distribution servermay refer collectively to hardware capable of processing data andsoftware that manages the hardware. The terms “unit”, “system” and“device” as used herein are intended to refer to a combination ofhardware and software that runs by the corresponding hardware. Forexample, the hardware may be a data processing device including aCentral Processing Unit (CPU), a Graphic Processing Unit (GPU) or otherprocessor. Additionally, the software may refer to a process beingexecuted, an object, executable, a thread of execution and a program.

Additionally, the system 1 for generating a video synopsis may includeat least one database (DB). In an embodiment, the system 1 forgenerating a video synopsis includes a source DB 20, a background DB 30and a tube DB 40.

Each DB 20, 30, 40 refers to a set of lots of structured, unstructuredor semi-structured data, and/or hardware that stores the data. Thestructured data is data stored in a fixed field, and includes, forexample, a relational database and a spreadsheet. Additionally, theunstructured data is data that is not stored in a fixed field, andincludes, for example, text document, image, video, audio data.Additionally, the semi-structured data is data that is not stored in afixed field, but includes metadata or schema, and the semi-structureddata includes, for example, XML, HTML, and text.

The details such as information stored in each DB 20, 30, 40 aredescribed below.

The video analysis unit 10 is configured to process the source video. Inan embodiment, the video analysis unit 10 is configured to perform anobject based video analysis.

The source video is a video captured by the imaging device (not shown).In an embodiment, the video may be an endless video includingsubstantially infinite and unbounded number of objects.

The imaging device is a device that images an object to generate animage made up of a single frame and a video made up of successiveframes, and includes a CCTV, a smartphone, a black box and a camera, butis not limited thereto.

In some embodiments, the imaging device may further include a componentsuch as an infrared camera to assist in imaging. In this case, thesource video includes multiple videos representing the same view andtime in different styles.

In an embodiment, the source video captured by the imaging device isstored in the source DB 20. The source video may further include data ofmultiple frames as well as the shoot time of each frame, identificationinformation of each frame and information of the imaging device (forexample, including the model of the imaging device (for example, themanufacturer's reference number), an identifier (for example, ID),etc.).

The video analysis unit 10 acquires the source video from the imagingdevice or the source DB 20. The imaging device and/or the source DB 20may be located remotely from the video analysis unit 10. In this case,the video analysis unit 10 includes a component for acquiring the sourcevideo. As an example, the video analysis unit 10 may acquire the sourcevideo (for example, by a transmitter/receiver or a wired port) viawired/wireless electrical connection.

The video analysis unit 10 is configured to perform an operation of

an objectbased analysis for the source video. To this end, in anembodiment, the video analysis unit 10 may include a source objectdetection unit 11; a motion detection unit 13; a tube generation unit15; and a scene understanding unit 17.

The source object detection unit 11 is configured to detect the objectincluded in the source video. The object detection is the operation ofdetermining where an object corresponding to a specific class is locatedin a given video (if any). That is, the source object may be detected bydetermining the class and position of the object appearing on the sourcevideo (or frame). When no class is determined, it is determined thatthere is no object. Meanwhile, the object may be identified by thecorresponding class determined through the detection process.

In an embodiment, the source object detection unit 11 may include anobject detection model pre-learned to perform the detection operation.The parameters of the object detection model are determined by machinelearning to determine classes by extracting features from a specificregion using multiple training samples. The source object detection unit11 may detect the source object in the source video through the objectdetection model.

In an embodiment, before determining the class of the object in thevideo, the source object detection unit 11 may be configured to set aproposed region where the object is located in the video, and analyzethe set image. The set region may be referred to as a region of interest(ROI) candidate box, and it is a sort of object localization andcorresponds to an initial object detection operation.

In the above embodiment, the set candidate region is applied to theobject detection model, and the class of the candidate region isdetermined by extracting the features from the candidate region.

The method of presetting the candidate region includes, for example,sliding window, selective search and region proposal, but is not limitedthereto.

The object detection model may be, for example, a model based on the CNNalgorithm, including R-CNN, Fast R-CNN, Faster R-CNN, and YOLO or YouOnly Look Once, but is not limited thereto.

In an embodiment, the source object detection unit 11 may detect thesource object through segmentation. Here, the segmentation involvessegmenting a given video at a pixel level to determine where an objectcorresponding to a specific class is located in the video (if any). Thatis, the class and position of the object may be determined bydetermining the class of the object for each pixel on the source video(or frame), and finally, the source object may be detected.

The source object detection unit 11 detects the source object, forexample, through the segmentation technique including semanticsegmentation and instance segmentation.

The semantic segmentation technique is the technique of determining theclass to which each pixel belongs. Each pixel is labeled with datarepresenting the class to which each pixel belongs, and the results ofthe semantic segmentation may produce a segmentation map showing thedetermined class of each pixel. The semantic segmentation technique doesnot distinguish between multiple objects determined as the same class.

The source object is detected through the instance segmentationtechnique. It is performed in a similar way to the semantic segmentationtechnique. However, as opposed to the semantic segmentation technique,the instance segmentation technique distinguishes between multipleobjects determined as the same class. That is, when the instancesegmentation technique is used, segmentation is performed whileperforming object recognition.

The object detection model that detects the source object throughsegmentation may be, for example, a model based on the CNN algorithm,including Fully Convolutional Network (FCN), DeepLab, U-Net and ReSeg,but is not limited thereto.

In some embodiments, the source object detection unit 11 may performobject detection and segmentation at the same time. To this end, thesource object detection unit 11 includes the object detection modelpre-learned to perform object detection and segmentation at the sametime.

As an example, the object detection model is configured to determine aROI through a submodel (for example, Faster R-CNN) that acts as theexisting object detector, and perform instance segmentation through asubmodel (for example, FCN) that masks an object included in each ROI(i.e., mask segmentation). Here, the masking refers to setting a regionwhere an object is located in a video, and includes less non-objectregions than the above-described ROI candidate video.

The object detection model that performs object detection andsegmentation at the same time may be, for example, a model based on theMask R-CNN algorithm, but is not limited thereto.

The classes for object detection rely on a training set includingmultiple training samples each having multiple videos. When machinelearning is completed through the training samples, the object detectionmodel has classes corresponding to the training samples as the classesfor object detection. Additionally, classes may be added by re-trainingthrough the multiple training videos.

The training set includes, for example, Pascal VOC dataset including 20classes, or CoCO dataset including 80 classes, but is not limitedthereto.

As above, the source object detection unit 11 may detect the sourceobject, and acquire information of the source object (source objectinformation). The source object information includes, for example, theposition of the object, the boundary between the object and thebackground and the class of the object. The position of the object isinformation about where the source object is located in the sourceframe. The boundary is a boundary between the region occupied by thesource object in the source frame and the remaining region, and theregion does not refer to the shape of the actual source object, and isdetermined based on information acquired in the object detectionprocess. For example, when the source object is detected through the ROIcandidate box, the ROI candidate box may be determined as the boundarybetween the object and the background. As another example, when thesource object is detected through segmentation, the boundary between theobject and the background may be determined according to thesegmentation region.

Additionally, the source object detection unit 11 may extract thebackground from the source video.

In an embodiment, the source object detection unit 11 may extract thebackground from the source video through a variety of backgroundsubtraction algorithms. The background subtraction algorithm mayinclude, for example, Mixture of Gaussian (MoG), improved Mixture ofGaussian (MoG), ViBe and Graph-cut, but is not limited thereto.

For example, the source object detection unit 11 may extract thebackground by applying the improved Mixture of Gaussian (MoG) using theGaussian mixture model (GMM) to the source video. The source objectdetection unit 11 samples background frames for a predetermined periodof time through the improved MoG using GMM to generate a backgroundmodel, and computes changes in images of frames each time. The sourceobject detection unit 11 may extract the background by detecting changesin the full video caused by changes in lighting component based on thevideo for the predetermined period of time, thereby reducing backgroundextraction errors caused by changes in lighting. As a result, it ispossible to perform a high quality background extraction function withreduced background extraction errors caused by changes in lighting dueto a high extraction rate.

In other embodiment, the source object detection unit 11 may extract thebackground by detecting the background through segmentation.

For example, the source object detection unit 11 may extract thebackground through a background detection model that determines at leastone class regarded as the background for each pixel. The class of thebackground detection model may include, for example, a road, a sidewalk,a wall, sky, a fence and a building, but is not limited thereto.

The background detection process through segmentation is similar to theabove-described object detection process through segmentation, and itsdetailed description is omitted herein.

In another embodiment, the background is acquired by removing the regionoccupied by the source object in the source video.

When the source object detection unit 11 sets the region where thesource object is located (for example, the region occupied by the sourceobject in the frame) to detect the source object, the source objectdetection unit 11 may extract the background by cutting the region ofthe source object detection process.

For example, when the source object is detected through segmentation(such as, for example, instance segmentation), the segmentation results(for example, the masking results) are determined as the region occupiedby the source object in the source video, and the corresponding regionis cut.

Alternatively, when the source object is detected by setting the ROIcandidate box, the ROI candidate box is determined as the regionoccupied by the source object in the source video, and the correspondingregion is cut.

In some embodiments, the source object detection unit 11 may be furtherconfigured to additionally perform the segmentation operation after thedetection operation for more accurate background extraction.

For example, the source object detection unit 11 may determine a regionthat is smaller than the ROI candidate box as the region occupied by thesource object in the source video by segmenting the ROI candidate box.The segmentation operation is similar to the object detection operationthrough segmentation, and its detailed description is omitted herein.Through this cutting operation, the source object detection unit 11acquires multiple frames except the source object region. In anembodiment, the source object detection unit 11 may acquire thebackground in which at least part of the excluded region is filled,based on the multiple frames except the source object region.

In some embodiments, the extracted background may be stored in thebackground DB 30.

The motion detection unit 13 detects the motion of the object in thevideo. In an embodiment, the motion detection unit 13 is configured totrack the specific source object in a subset of frames in which thespecific source object is detected. Accordingly, the motion detectionunit 13 may compute tracking information as a result of tracking.

The subset of frames representing the source object, selected based oneach object in the source video, will be used for video synopsis. Thesource object detected by the source object detection unit 11 mayinclude a stationary object. This result occurs when the objectdetection model is configured to classify the class corresponding to thestationary object.

In general, the object in which the user gets interested in a videosynopsis is a moving object. When the source video is an endless video,a useful video synopsis may be generated by generating successivesequences for the same object (for example, having the same ID) amongobjects detected in multiple frames.

The source object is detected in a series of frames. The motiondetection unit 13 may track the detected source object by connecting thedetected source object in the series of frames based on changes in timeand color differences between the frames.

The tracking results are represented as a sequence of the correspondingframes including the source object. When the trajectory of the trackingresults (for example, location, direction, etc.) is analyzed, the sourceobject's activity is generated. The activity of the source object ineach frame is eventually represented in the sequence of thecorresponding frames including the source object mask. Here, the sourceobject mask is preset (for example, by the source object detection unit11) in the segmentation or candidate region setting process.

The motion detection unit 13 computes the tracking information of thesource object based on the series of frames. As an example, when a colordifference between locations of at least part of the source object maskin each frame is greater than a preset value, the source object isclassified as a moving object, and the tracking information is computedbased on changes in time and color difference between the frames.

However, the present disclosure is not limited thereto, and a variety ofalgorithms suitable for detecting the motion of the object in the videomay be used. For example, an algorithm for tracking the object in thevideo may be used.

In some embodiments, the tracking information includes whether theobject is moving or not, velocity, speed and direction, but is notlimited thereto. Here, the moving object includes an object that movesin a specific frame subset in a series of frames, but does not move in adifferent frame subset.

The tube generation unit 15 generates a source object tube including thesource object on which the motion is detected. In an embodiment, thesource object tube is generated based on the source object detectionresults and/or the tracking information. As described above, the tube isa series of motion connected source object or object sets, and refers tothe concatenation of sequences representing the object or activityacross the frames of the source video.

In an embodiment, the tube generation unit 15 generates the tube of themoving source object. The tube of the moving source object is generatedbased on a subset of frames representing the activity or a combinationof the subset of frames representing the activity and a subset of framesrepresenting the source object. For example, when the source objectstays still for a first time and moves for a second time, the tubeincluding a subset of frames of the first time and a subset of frames ofthe second time may be generated.

In an embodiment, the tube may filter the region including the sourceobject in the source frame. The video substantially required to generatea video synopsis is the video of the source object. The tube generationunit 15 may generate the tube including the video of the region occupiedby the source object.

As an example, the tube generation unit 15 may generate the tube made upof only the bounding box including the source object.

As another example, the tube generation unit 15 may segment the sourceobject and remove the background to generate the source object tube.

In some embodiments, the filtering operation may be performed based onthe information acquired in the source object detection process (forexample, by the source object detection unit 11).

As an example, when the source object is detected by setting an interestbox region, the interest box region may be used as the bounding box. Asanother example, when the source object is detected throughsegmentation, the source object tube may be generated by using thesegmentation in the detection process, and removing the background.

Alternatively, the tube generation unit 15 may filter the video of theregion including the source object based on the position of the sourceobject acquired in the source object detection process or the boundarybetween the object and the background.

In some embodiments, the tube generation unit 15 may perform thefiltering operation using the source object detection results.

By the filtering operation, the source object tube in which at leastpart of the background is removed is generated.

The tube generation unit 15 is further configured to label the tube withdetailed information about the tube (hereinafter, “tube information”).The tube information includes, for example, source object relatedinformation (for example, class information, position information, shoottime information, etc.), source video related information (for example,a source video identifier, total playback time, etc.), playback rangeinformation of the corresponding tube in the source video, and playbacktime information of the corresponding tube.

In some embodiments, the source object tube and/or the labeled tubeinformation may be stored in the tube DB 40. The tube DB 40 providestubes used to generate a video synopsis.

For higher level scene analysis of the source video, it is necessary toanalyze the components of the scene, the object and the background, aswell as a relationship between the components. In the relationshipbetween the components, a relationship required to understand the sceneis referred to as interaction.

The interaction is given when the component does an action. Here, theaction refers to an action by the object, and a change in backgroundover time is not regarded as the action. Accordingly, the interactionanalyzed by the system 1 for generating a video synopsis includesinteraction between the object and the background or interaction betweenobjects. That is, the interaction is associated with at least one movingobject.

The scene understanding unit 17 is configured to determine interactionassociated with the source object of the tube. By determining theinteraction associated with the individual object, the system 1understands the scene representing the action of the object. A targetfor scene understanding includes the source video including the sourceobject of the tube.

A typical example of the moving object is a human. As the action of thehuman inevitably takes place in an endless video, the human hasinteraction with the background or a different object (for example, anarticle or a different human). Hereinafter, for clarity of description,the present disclosure will be described in more detail throughembodiments in which at least one of objects having interaction is ahuman. That is, hereinafter, it is determined that at least one ofsource objects having interaction corresponds to a human class. Theaction of the human is understood as verb.

However, the embodiments of the present disclosure are not limitedthereto, and it will be obvious to those skilled in the art thatinteraction is also determined in embodiments in which the moving objectis a moving non-human (for example, interaction between an article andthe background or interaction between articles).

In an embodiment, the scene understanding unit 17 may determine theinteraction associated with the source object of the tube through aninteraction determination model. The interaction determination model ispre-learned to determine the interaction associated with the objectincluded in the input image.

In an embodiment, the interaction determination model may include avariety of CNNs including Instance-Centric Attention Network (ICAN),BAR-CNN, InteractNet and HO-RCNN. The interaction determination model ismachine-learned by the fully-supervised technique or semi-supervisedtechnique.

The fully-supervised technique is the technique that recognizes theobject corresponding to the original rule in the input image. Theoriginal rule presets interaction determined for each object.

The parameters of the network (or layer) of the interactiondetermination model that characterizes the relationship between thehuman and the object are learned through the fully-supervised technique.The interaction determination model learned through the fully-supervisedtechnique may compute interaction scores between videos eachrepresenting two objects (for example, masked by the bounding box).

FIG. 2 is a diagram illustrating the fully-supervised interactiondetermination model according to an embodiment of the presentdisclosure.

The interaction determination model 200 of FIG. 2 determines interactionby the fully-supervised technique, and the interaction determinationmodel 200 includes: an input layer 210 to transmit an input imageincluding at least one source object for scene understanding to a neuralnetwork 230; the neural network 230 including a plurality of neuronsconfigured to receive the video from the input layer 210 and extractfeatures for determining interaction in the corresponding video, and adetermination layer 250 to determine the interaction associated with theobject.

The input layer 210 of FIG. 2 receives the input image and transmits theinput image to the neural network 230. In an embodiment, the input layer210 may be configured to localize a region in which features are to beextracted. For the localization, the input layer 210 may resize theinput image.

In some embodiments, the input layer 210 may be configured to crop theobject in the input image, or resize the input image or the croppedvideo to a preset size.

The localization process of the input layer 210 is similar to the ROIcandidate box setting process of the object detection model and itsdetailed description is omitted herein.

In some other embodiments, the localization operation of the input layer210 is performed using the information of the source object detectionprocess (for example, by the source object detection unit 11). Theinformation of the detection process includes, for example, the positionof the source object, and/or the boundary between the object and thebackground.

In some embodiments, the determination layer 250 may include a fullyconnected layer in which all input nodes are connected to all outputnodes.

The interaction determination model 200 of FIG. 2 is a machine learningmodel of CNN structure having multi-stream architecture. Themulti-stream includes an object stream and a pairwise stream included inthe input image. The object stream refers to a flow of data processingof the video of the corresponding object through the input layer 210,the neural network 230 and the determination layer 250. The pairwisestream refers to a flow of data processing of each video (for example, ahuman video and an object video) including each object in pair throughthe input layer 210, the neural network 230 and the determination layer250. The resized video in the object stream has a size including thecorresponding object, but the resized video in the pairwise stream isconfigured to have a size including the two objects in pair. In the caseof interaction between the background and the object, the pairwisestream is configured to have a size including both the object and thebackground.

For example, as shown in FIG. 2, when the full input image is a videocaptured at the moment when a human rides a bicycle, the stream includesa human stream, a bicycle stream and a human-bicycle pair stream. In thehuman stream, the localized video for feature extraction has a sizeincluding the human, but in the human-bicycle pair stream, the localizedvideo for feature extraction has a size including the human and thebicycle.

The interaction determination model of FIG. 2 determines interactionbased on the features extracted from each stream. Human related featuresare extracted from the human stream, bicycle related features areextracted from the bicycle stream, and interaction related features areextracted from the pair stream.

The determination layer 250 is pre-learned to generate class scoresrepresenting interaction classes to which the input image belongs basedon the extracted features, and the learned determination layer maygenerate the class scores for interactions of the input image.

Assume that the interaction determination model of FIG. 2 is designed todetermine only interaction for “biking” (i.e., determine whether itcorresponds to a single class). The determination layer 250, which isthe final layer of the interaction determination model, is a binaryclassifier that computes a score for “biking” (for example, aprobability value of belonging to a class).

In an embodiment, the determination layer 250 is further configured tomerge the class scores computed through the fully connected layer ineach stream. Accordingly, the interaction analysis results for eachstream (i.e., class scores for each stream) are merged to acquire afinal score for the original input image.

In some embodiments, the determination layer 250 may acquire the finalscore by summing the class scores for each stream. In this case, theinteraction determination model of FIG. 2 is learned to determine thecorresponding class (i.e., interaction) based on the final scorecomputed by summing.

As above, the interaction determination model of FIG. 2 is learned topreset an associable interaction with the object (for example, thehuman). That is, the interaction relies on the type of activity object.To this end, the interaction determination model of FIG. 2 ispre-learned using a training set including training videos representingthe associable interaction.

To avoid limiting the activity object to a human, when it is learned topreset an associable interaction with a non-human object, theinteraction determination model of FIG. 2 may determine interactionbetween non-human objects.

In an embodiment, the interaction determination model includes multiplesubmodels, and each submodel is learned to preset interaction by thetraining set for each object as the subject of the action.

On the other hand, the semi-supervised technique does not presetinteraction to be determined for each object. The semi-supervisedtechnique determines interaction by analyzing the interaction class(i.e., class scoring) for each set of a human action (verb) and adifferent object (for example, an article). Accordingly, interaction isnot learned for each object as the subject of the action.

FIG. 3 is a diagram illustrating the semi-supervised interactiondetermination model according to an embodiment of the presentdisclosure.

The interaction determination model 300 of FIG. 3 is learned todetermine interaction by the semi-supervised technique. Referring toFIG. 3, the interaction determination model 300 includes: an activitydetection network 310 to detect activity of a specific object in aninput image; and an object detection network 330 to detect an objectthat is different from the activity object by extracting features in theinput image.

The networks 310, 330 are independent networks since their relatedobjects are different. The scene understanding unit 17 performs the dataprocessing process in parallel via each network 310 or 330.

The networks 310, 330 may include a convolution network that extractsfeatures for detecting the object in the input image. As an example, theconvolution network of the activity detection network 310 may be aconvolution network that extracts features by the Faster R-CNN model asshown in FIG. 3. However, the activity detection network 310 is notlimited thereto, and may include a variety of convolution networks forconvolved feature extraction, including a convolution layer, a ReLUlayer and/or a pooling layer.

As above, each network 310, 330 has similar structures, but they aredifferent in extracted features. Additionally, due to the difference infeatures, their functions are also different.

The activity detection network 310 is configured to extract the featuresfor determining the class corresponding to the object (according to theassumption, the human) in the video as the subject of the action, and/orthe features for determining the class corresponding to the activity.The interaction determination model 300 may detect the subject of theaction by determining the class of the object appearing in the video,and detect the activity of the subject of the action by determining theclass of the activity appearing in the video.

In an embodiment, the features for detecting the object (for example, ahuman) as the subject of the action include appearance features.Additionally, the features for detecting the activity include posefeatures.

The interaction determination model 300 computes activity features byapplying the features (for example, the appearance features and the posefeatures) extracted by the convolution network of the activity detectionnetwork 310 to the fully connected layer of the activity detectionnetwork 310. The activity features include activity-appearance featuresand activity-pose features.

The activity detection by the activity detection network 310 relies onthe preset class, and the class is an activity class corresponding toactivity that can be done by the subject of the action. Accordingly,when the subject of the action for the activity detection network 310 isa moving article (for example, a vehicle), the vehicle is detected andthe activity of the vehicle is detected.

The object detection network 330 is configured to extract features fordetecting an object that is different from the subject of the action.The class of the object detection network 330 is an object class thatcan be represented in the video.

The features extracted by the object detection network 330 includeappearance related features.

The interaction determination model 300 computes object features byapplying the features (for example, appearance features) extracted bythe convolution network of the object detection network 330 to the fullyconnected layer of the object detection network 330.

FIG. 4 is a diagram illustrating the process of training the interactiondetermination model of FIG. 3.

Referring to FIG. 4, training samples including multiple videos areinputted to the networks 310, 330 of FIG. 4 as the input image forlearning. Each training sample is a video including the subject of theaction and a different object.

The class (i.e., activity class) of the activity detection network 310and the class (i.e., object class) of the object detection network 330are determined by the training samples. Additionally, the parameters ofthe activity detection network 310 and the object detection network 330are learned in a way that minimizes the loss function of each network310, 330.

The loss function represents a difference between the result value fromthe network and the actual result value. The parameter updates aregenerally referred to as optimization. As an example, the parameteroptimization may be performed via Adaptive Moment Estimation (ADAM), butis not limited thereto, and the parameter optimization may be performedby a variety of gradient descent techniques such as Momentum, NesterovAccelerated Gradient (NAG), Adaptive Gradient (Adagrad) and RMSProp.

Referring back to FIG. 3, the interaction determination model 300 isfurther configured to determine the interaction based on the featurescomputed by each network 310, 330.

In an embodiment, the interaction determination model 300 computesscores (for example, a probability value predicted to belong to theclass) representing the prediction of belonging to the activity classand the object class from the activity features and the object featurescomputed as the output results of each network 310, 330. The scorescomputed from the activity features are computed for each class, andeventually, a score set relying on the number of activity classes iscomputed. Additionally, the scores computed from the object features arecomputed for each class, and eventually, a score set relying on thenumber of object classes is computed.

The interaction determination model 300 links the score set for theactivity and the score set for the object to generate an interactionmatrix. The elements of the interaction matrix represent interactiondetermination scores to be determined as interactions includingactivities and objects corresponding to each row and each column.

In FIG. 3, the interaction matrix includes the probability value set forthe activity in column and the probability value set for the object inrow, but it will be obvious to those skilled in the art that theinteraction matrix may be formed to the contrary.

In an embodiment, the interaction determination score may be an averageof scores in each score set corresponding to each row and each column,but is not limited thereto. Other suitable techniques for computingother types of representative values may be used to determine theinteraction.

Describing the interaction matrix for the input image in FIG. 3, theinteraction determination score of the matrix element including “ride”as the activity class and “horse” as the object class has the highestvalue. Thus, the scene understanding unit 17 may determine that theinput image has interaction “a human rides a horse” through theinteraction determination model 300.

In an embodiment, the interaction determination model 300 may be furtherconfigured to use the information of the source object detection process(for example, by the source object detection unit 11).

In some embodiments, the interaction determination model 300 may set theregion where the features will be extracted by each network 310, 330 todetermine the interaction based on the source object detection results.For example, the bounding box of the interaction determination model 300is set based on the position of the detected source object.

In some embodiments, the interaction determination model 300 may beconfigured not to extract the features for detecting the object as thesubject of the action and a different object as a candidate expected tohave interaction with the subject of the action. As an example, theactivity detection network 310 may be only configured to extract thefeatures (for example, the pose features) for detecting the activity ofthe specific source object as the subject of the action acquired by thesource object detection results. As another example, the interactiondetermination model 300 is configured to use the source object detectionresults rather than including the object detection network 330. In thiscase, the object detection scores computed for source object detectionare used to generate the interaction matrix.

Meanwhile, the process of determining the interaction between the objectand the background and the learning process are also similar to theprocess of determining the interaction between objects and the learningprocess shown in FIGS. 3 and 4. In this case, each training sample is avideo including the subject of the action and the background, and themodel 300 includes a network learned to determine the class of thebackground. In some embodiments, the network 330 may be further learnedto determine the class of the background.

As above, the scene understanding unit 17 may understand the scenerepresenting the source object of the tube by analyzing the activity ofthe source object of the tube.

The interaction determined by the scene understanding unit 17 may befurther labeled to the source object tube associated with thecorresponding interaction. When the source object tube is stored in thetube DB 40, the interaction associated with the source object is alsostored in the tube DB 40.

The system 1 for generating video synopsis receives a user queryrequesting video synopsis generation, and generates a video synopsisthat meets the user's request based on the user query.

Referring back to FIG. 1, the system 1 includes the video synopsisgeneration unit 50 to generate a video synopsis based on the sourceobject tube and the associated interaction. In an embodiment, the videosynopsis generation unit 50 may generate the video synopsis in responseto the user query. In some embodiments, the video synopsis generationunit 50 may include a synopsis element acquisition unit 51, a tubearrangement unit 53 and a synopsis video generation unit 55.

The system 1 acquires a video element required to generate the videosynopsis based on the user query (for example, by the synopsis elementacquisition unit 51). The user query includes an object (hereinafter, a“synopsis object”) that will appear in the video synopsis, or aninteraction (hereinafter, a “synopsis interaction”) that will appear inthe video synopsis. In some embodiments, the user query may furtherinclude the total playback time of the video synopsis, or a specificplayback range of the source video required for the video synopsis.

The system 1 for generating a video synopsis may select the synopsisobject and the synopsis interaction of the user query from the sourcevideo (for example, by the synopsis element acquisition unit 51).

In an embodiment, the synopsis element acquisition unit 51 may acquire asource object corresponding to the synopsis object from source objectsstored in the tube DB 40.

For example, the synopsis element acquisition unit 51 may acquire asource object corresponding to the synopsis object by searching for thesource object having a class that matches a class of the synopsis objectamong multiple source objects stored in the tube DB 40.

The source object corresponding to the synopsis object may be associatedwith multiple interactions or different interactions. To use a sourceobject tube that accurately meets the user's request, the synopsiselement acquisition unit 51 filters the source object having interactioncorresponding to the synopsis interaction in the source objectcorresponding to the synopsis object. Thus, a synopsis object having aninteraction class that matches the required class is acquired.

The system 1 for generating a video synopsis determines the sourceobject acquired corresponding to the user query (i.e., matching theobject class and the interaction class) as the synopsis object, and usesthe source object tube to generate a video synopsis. That is, the sourceobject acquired corresponding to the user query by the system 1 forgenerating a video synopsis is the synopsis object selected to generatea video synopsis.

The tube arrangement unit 53 arranges the synopsis object tubes selectedcorresponding to the user query. The arrangement is performed bydetermining the start time of the selected synopsis object tubes.

In an embodiment, the tube arrangement unit 53 determines the start timeof the selected synopsis object tubes in a way that minimizes collisionbetween the selected synopsis object tubes. In the process of minimizingcollision, temporal consistency is maintained to the maximum. Thearrangement by this rule may be referred to as optimized arrangement.

FIG. 5 is a conceptual diagram of the video synopsis generated by thesystem for generating a video synopsis according to an embodiment of thepresent disclosure.

In an embodiment, the tube arrangement unit 53 shifts each event objecttube (tube shifting) until a collision cost is less than a predeterminedthreshold Φ while maintaining temporal consistency. Here, the collisioncost is represented by every two tubes and temporal overlaps for allrelative temporal shifts between them. The predetermined threshold Φ maybe different values according to circumstances. For example, when thereare a small number of event objects included in a cluster, thepredetermined threshold Φ may be set to a relatively small value due tothe spatio-temporal redundancy in the video synopsis. Additionally, thelength L of the video synopsis is minimized.

The optimization process is performed by grouping objects of interesttemporally and spatially. Referring to FIG. 5, synopsis object tubesselected from a non-chronological video synopsis are arranged as shownin the results (A) of FIG. 5, and synopsis object tubes selected from aspatio-temporal group-based video synopsis are arranged as shown in theresults (B) of FIG. 5. In contrast, synopsis object tubes from a videosynopsis corresponding to the user query are arranged as shown in theresults (C) of FIG. 5.

In an embodiment, when the user query includes multiple synopsisinteractions, the tube arrangement unit 53 groups tubes having the samesynopsis interaction to generate groups by interaction.

Thus, two or more groups by interaction are generated. The tubearrangement unit 53 arranges in a way that minimizes collision betweeneach group. In the arrangement between groups, a chronological order isnot considered. It is because the user's main interest is interaction.

The arrangement process with minimized collision between groups issimilar to the above-described arrangement with minimized collisionbetween tubes, and its detailed description is omitted herein.

Additionally, in arranging each interaction group, the tube arrangementunit 53 may further arrange the tubes included in the groups. In someembodiments, multiple synopsis object tubes included in a group havingthe same interaction may be arranged based on the shoot time in thesource video. The reason is that the user's main interest is satisfiedthrough grouping, and thus, there is no need to ignore the chronologicalorder.

As above, it is possible to provide an efficient video synopsis to theuser who pays attention to interaction, by arranging tubes havingdifferent interactions in a distinguishable way to prevent collisionbetween groups having the same interaction.

The synopsis video generation unit 55 generates a synopsis video basedon the arranged synopsis object tube sets and the background. To thisend, the synopsis video generation unit 55 selects a suitable backgroundfor the selected synopsis object. In an embodiment, the suitablebackground is the background of the source video including the selectedsynopsis object.

In some embodiments, when the user query includes information related tothe source video, the background of the source video is determined asthe background for the synopsis video. The background of the sourcevideo may be searched from the background DB 30 (for example, by thesynopsis element acquisition unit 51 or the synopsis video generationunit 55).

Additionally, when the user query includes information related to theplayback range of the source video, a change in background over timeduring the playback range may be applied to the found background. Inthis case, the time-lapse background is used as the background of thesynopsis video.

In some other embodiments, the background for generating a videosynopsis is acquired based on the source object acquired as the synopsisobject. For example, the background of the source video of the selectedsynopsis object tube is selected as the background of the synopsisvideo.

In other embodiment, the suitable background may be a user-definedbackground separately acquired as the background of the synopsis video.

The synopsis video generation unit 55 generates a synopsis video bycombining the synopsis object tube with the selected background. Thesynopsis video generation unit 55 may stitch the synopsis object tubeand the selected background by applying the position of the synopsisobject to the selected background. The stitching operation may beperformed through a variety of suitable stitching algorithms.

In some embodiments of the present disclosure, relatively accuratesegmentation may be performed on the selected synopsis object, therebyminimizing a background mismatch (for example, when the source object isdetected through segmentation).

The synopsis video generation unit 55 may be further configured toperform sampling. In an embodiment, when the playback time of thesynopsis video is set by the user query, the synopsis video generationunit 55 may sample the synopsis video at the set playback time so thatthe video synopsis has the set playback time.

Each unit 10, 11, 13, 15, 17, 20, 30, 40, 50, 51, 53, 55 of the system 1for generating a video synopsis according to embodiments is notnecessarily intended to indicate physically distinguishable separatecomponents. That is, although FIG. 1 shows distinguishable separateblocks, according to embodiments, some or all of the units may beintegrated in a same device (for example, a server including adatabase). That is, each unit 10, 11, 13, 15, 17, 20, 30, 40, 50, 51,53, 55) is functionally distinguished according to their operations in acomputing device into which they are implemented, and each unit does notneed to be provided independently of each other.

It will be obvious to those skilled in the art that the system 1 forgenerating a video synopsis may include other components not describedherein. For example, the system 1 for generating a video synopsis mayfurther include a data input device, an output device such as a displayand a printer, a storage device such as memory, a transmitter/receiverto transmit and receive data via electrical communication, a network, anetwork interface and a protocol.

As shown in FIG. 2, the above-described embodiments include an onlinephase (for example, the video processing unit 10) in which a video isprocessed irrespective of a user query and a response phase in which avideo synopsis is generated in response to a user query, but theembodiments of the present disclosure are not limited thereto.

In an embodiment, the system 1 for generating a video synopsis mayinclude a single phase in which a video synopsis is generated byanalysis of a source video in response to a user query. In this case,when receiving the user query, the system 1 for generating a videosynopsis is configured to acquire a source video from the source DB 20in which source videos are stored, and generate a video synopsiscorresponding to the user query by analysis of the corresponding sourcevideo.

A method for generating a video synopsis according to an aspect of thepresent disclosure may be performed by a computing device including aprocessor. In an embodiment, the method for generating a video synopsismay be performed by part or all of the video synopsis system 1.

FIG. 6 is a flowchart of the method for generating a video synopsisaccording to an embodiment of the present disclosure.

Referring to FIG. 6, the method for generating a video synopsisincludes: receiving a user query (S601); performing an object basedanalysis of a source video (610); and generating a synopsis video inresponse to a video synopsis generation request from a user (S650).

In an embodiment, the object-based analysis for the source image (S610)may be performed before the step S601. A series of processes therefor isshown in the system of FIG. 1.

In other embodiment, the object-based analysis for the source image(S610) may be performed after the step S601. A series of processestherefor is may be obtained by modifying the system of FIG. 1. Forexample, the modified system of FIG. 1 for performing the aboveembodiment may be configured such that the synopsis element acquisitionunit 51 includes the video analysis unit 10, and the analysis operationof the video analysis unit 10 is performed in response to the userquery.

FIG. 7 is a flowchart illustrating the source video analysis processaccording to an embodiment of the present disclosure.

Referring to FIG. 7, the step S610 includes: detecting at least onesource object in a source video including the at least one source object(S611); detecting a motion of the detected source object (S613);generating a source object tube including the source object on which themotion is detected (S615); and determining interaction associated withthe source object of the tube (S617).

In an embodiment, the step S611 is performed through the objectdetection model in the source video. Here, the object detection model isa model that is pre-learned to determine a class corresponding to anobject included in an input image by extracting features for detectingthe object in the input image.

The object detection model used in the step S611 is described in detailabove with reference to the source object detection unit 11, and itsdetailed description is omitted herein.

In an embodiment, the step S610 further includes: extracting thebackground from the source video (S612). In some embodiments, in thestep S612, the background of the source video may be extracted bydetecting the object in the source video. The extracted background ofthe source video may be used as a background of a synopsis video.However, the background of the synopsis video is not limited thereto.

The background extraction process for performing the step S612 isdescribed in detail above with reference to the source object detectionunit 11, and its detailed description is omitted herein.

In the step S613, activity information including at least one of whetherthe source object is moving or not, velocity, speed, and direction iscomputed from the source video.

The frame of the source video in which the tracking information iscomputed is a frame representing the activity, and is used as at leastpart of a source object tube.

In an embodiment, the source object tube is generated based on a subsetof frames representing the activity of the source object, or acombination of the subset of frames representing the activity and asubset of frames representing the source object (S615).

In some embodiments, the source object tube may be a source object tubein which at least part of the background of the source video is removedby filtering a region including the source object in the frame of thesource video (S615).

The tube generation process of the step S615 is described in detailabove with reference to the tube generation unit 15, and its detaileddescription is omitted herein.

In an embodiment, interaction associated with the source object of thetube may be determined through the interaction determination modelpre-learned to determine interaction associated with the object includedin the input image (S617). Here, an object under consideration as towhether it is associated with interaction is the source object of thetube.

In some embodiments, the interaction determination model may be a modelconfigured to receive image having a size including a first object asthe input image, extract a first feature, receive image having a sizeincluding a region including the first object and a different region asthe input image, extract a second feature, and determine interactionassociated with the first object based on the first feature and thesecond feature. Here, the different region includes a region including asecond object that is different from the first object in the sourcevideo, or the background.

In some other embodiments, the interaction determination model mayinclude: an activity detection network to detect activity of thespecific object in the input image; and an object detection network todetect an object that is different from the activity object byextracting features from the input image. Here, the activity detectionnetwork is configured to extract the features for determining the classof the activity appearing in the video, and the object detection networkis configured to extract the features for determining the class of theobject appearing in the video.

The interaction determination model of the step S617 is described indetail above with reference to FIGS. 2 and 3, and its detaileddescription is omitted herein.

In an embodiment, the source object tube may be labeled with at leastone of source object related information or source video relatedinformation as a result of the detection of the source object, andinteraction determined to be associated with the source object of thetube.

FIG. 8 is a diagram illustrating the synopsis video generation processaccording to an embodiment of the present disclosure.

Referring to FIG. 8, the step S650 includes: selecting a tubecorresponding to a user query (S651); arranging the selected objecttubes (i.e., tubes of the source object selected as the synopsis object)(S653); and generating a synopsis video based on the arranged tubes andthe background (S655).

In the step S651, the tube corresponding to the user query is a tube ofthe source object corresponding to the synopsis object included in theuser query, associated with interaction corresponding to the synopsisinteraction included in the user query.

In an embodiment, the tube corresponding to the user query is selectedby: determining the source object corresponding to the synopsis objectin the detected source object, and filtering the source objectassociated with interaction corresponding to the synopsis interaction inthe selected source object (S651).

In the step S653, the tubes selected in the step S651 are arranged bydetermining the start time of the selected synopsis object tubes withminimized collision between the selected synopsis object tubes.

In an embodiment, the step of arranging the selected tubes may include:when the user query includes multiple synopsis interactions, groupingtubes having the same synopsis interaction to generate groups byinteraction; and determining the start time of each group forarrangement with minimized collision between each group.

In some embodiments, the step of arranging the selected tubes mayfurther include: determining the start time of the selected tubes byminimizing collision between the synopsis object tubes in the samegroup.

In the step S655, a synopsis video is generated based on the arrangedsynopsis object tubes and the background.

In an embodiment, the step S655 may include: stitching the source objecttubes, in which at least part of the background of the source video isremoved, with the background of the source video.

In an embodiment, the step S655 may further include: sampling theplayback speed of the synopsis video.

The steps S651 to S655 are described in detail above with reference tothe video synopsis generation unit 50, and its detailed description isomitted herein.

As above, the method for generating a video synopsis and the system 1therefor may detect the source object by object-based analysis for thesource image. Additionally, the motion of the source object may bedetected. Additionally, tubes used as the synopsis object tubes may beselected to generate a video synopsis based on the source object, themotion, and the interaction associated with the source object.

As a result, it is possible to provide a customized video synopsissatisfying the user's need better.

The operation of the method for generating a video synopsis and thesystem 1 therefor according to the embodiments as described above maybe, at least in part, implemented in a computer program and recorded ina computer-readable recording medium. For example, it may be implementedwith a program product on the computer-readable medium including programcode, and may be executed by the processor for performing any or all ofthe above-described steps, operations or processes.

The computer may be a computing device such as a desktop computer, alaptop computer, a notebook computer, a smart phone or like, and may beany integrated device. The computer is a device having at least onealternative and specialized processor, memory, storage, and networkingcomponent (either wireless or wired). The computer may run an operatingsystem (OS) such as, for example, OS that is compatible with MicrosoftWindows, Apple OS X or iOS, Linux distribution, or Google Android OS.

The computer-readable recording medium includes all types of recordingdevices in which computer-readable data is stored. Examples of thecomputer-readable recording medium include read only memory (ROM),random access memory (RAM), compact disc read only memory (CD-ROM),magnetic tape, floppy disk, and optical data storage and identificationdevices. Additionally, the computer-readable recording medium isdistributed over computer systems connected via a network, and may storeand execute the computer-readable code in a distributed manner.Additionally, a functional program, code and a code segment forrealizing this embodiment will be easily understood by persons havingordinary skill in the technical field to which this embodiment belongs.

While the present disclosure has been hereinabove described withreference to the embodiments shown in the drawings, this is provided byway of illustration and those skilled in the art will understand thatvarious modifications and variations may be made thereto. However, itshould be understood that such modifications fall within the scope oftechnical protection of the present disclosure. Accordingly, the truetechnical protection scope of the present disclosure should be definedby the technical spirit of the appended claims.

Recently, there are technological advances in video surveillance systemssuch as intelligent CCTV systems and their ever-increasing range ofapplications. The system for generating a video synopsis according to anaspect of the present disclosure may generate a video synopsisreflecting a specific interaction that the user desires to see, byanalyzing interaction of the source video through the interactiondetermination model based on one of technologies of the fourthindustrial revolution, machine learning, thereby providing maximumefficiency and convenience through minimal information, and it isexpected to have an easy access to the corresponding market and a greatripple effect.

What is claimed is:
 1. A system for generating a video synopsis,comprising: a source object detection unit configured to detect at leastone source object in a source video including at least one object; amotion detection unit configured to detect one or more motion of thesource object in the source video; a tube generation unit configured togenerate one or more source object tube including the source object onwhich the motion is detected; a scene understanding unit configured todetermine interaction associated with the source object of the tube; anda video synopsis generation unit configured to generate one or morevideo synopsis based on the source object tube associated with thedetermined interaction.
 2. The system for generating a video synopsisaccording to claim 1, wherein the source object detection unit detectsthe source object through an object detection model, and the objectdetection model is pre-learned to extract one or more feature fordetecting an object from an input image and determine a classcorresponding to the object included in the input image.
 3. The systemfor generating a video synopsis according to claim 2, wherein the objectdetection model is configured to detect the source object by extractingthe feature for detecting the object from the input image anddetermining the class to which each pixel belongs.
 4. The system forgenerating a video synopsis according to claim 3, wherein the objectdetection model includes: a first submodel to determine a region ofinterest (ROI) by detecting a position of the object in the input image;and a second submodel to mask the object included in the ROI.
 5. Thesystem for generating a video synopsis according to claim 1, wherein thesource object detection unit is further configured to extract abackground from the source video.
 6. The system for generating a videosynopsis according to claim 5, wherein the source object detection unitis configured to extract the background through a background detectionmodel that determines at least one class regarded as the background foreach pixel.
 7. The system for generating a video synopsis according toclaim 5, wherein the source object detection unit extracts thebackground by cutting a region occupied by the source object in thesource video.
 8. The system for generating a video synopsis according toclaim 5, further comprising: a background database (DB) to store theextracted background of the source video.
 9. The system for generating avideo synopsis according to claim 1, wherein the motion detection unitis configured to compute tracking information of the source object bytracking a specific object in a subset of frames in which the specificsource object is detected.
 10. The system for generating a videosynopsis according to claim 9, wherein tracking information of thesource object includes at least one of whether moving or not, avelocity, a speed, and a direction.
 11. The system for generating avideo synopsis according to claim 1, wherein the tube generation unit isconfigured to generate the source object tube based on a subset offrames representing an activity of the source object, or a combinationof the subset of frames representing the activity of the source objectand a subset of frames representing the source object.
 12. The systemfor generating a video synopsis according to claim 11, wherein the tubegeneration unit is further configured to: filter a region including thesource object in the frame of the source video, and generate the sourceobject tube in which at least part of background of the source video isremoved.
 13. The system for generating a video synopsis according toclaim 13, wherein the tube generation unit is further configured to:when a image region including the source object is extracted as a resultof the detection of the source object, generate the source object tubeusing the extracted image region instead of the filtering.
 14. Thesystem for generating a video synopsis according to claim 1, wherein thescene understanding unit is configured to determine the interactionassociated with the source object of the tube through an interactiondetermination model pre-learned to determine interaction associated withan object included in an input image, and the interaction determinationmodel includes a convolution network.
 15. The system for generating avideo synopsis according to claim 14, wherein the interactiondetermination model is learned to preset an associable interaction classwith the object, wherein the said object is a subject of an action thattriggers the interaction.
 16. The system for generating a video synopsisaccording to claim 15, wherein the interaction determination model isfurther configured to: receive image having a size including a firstobject as the input image and extract a first feature, receive imagehaving a size including a region including the first object and adifferent region as the input image and extract a second feature, anddetermine interaction associated with the first object based on thefirst feature and the second feature.
 17. The system for generating avideo synopsis according to claim 16, wherein the different regionincludes a region including a second object that is different from thefirst object, or a background.
 18. The system for generating a videosynopsis according to claim 16, wherein the first feature is a featureextracted to detect the source object.
 19. The system for generating avideo synopsis according to claim 14, wherein the interactiondetermination model is configured to determine a class of theinteraction by detecting an activity of a specific object which is asubject of an action triggering the interaction and detecting adifferent element associated with the interaction.
 20. The system forgenerating a video synopsis according to claim 19, wherein theinteraction determination model includes: an activity detection networkto detect the activity of the specific object in the input image; and anobject detection network to detect an object that is different from theactivity object by extracting a feature from the input image, theactivity detection network is configured to extract the feature fordetermining a class of the activity appearing in the video, and theobject detection network is configured to extract the feature fordetermining a class of the object appearing in the video.
 21. The systemfor generating a video synopsis according to claim 20, wherein thefeature for determining the class of the activity includes a posefeature, and the feature for determining the class of the objectincludes an appearance feature.
 22. The system for generating a videosynopsis according to claim 20, wherein the interaction determinationmodel is further configured to: link a set of values computed by theactivity detection network and a set of values computed by the objectdetection network to generate an interaction matrix, and determine theinteraction associated with the specific object in the input image basedon the activity and the object corresponding to a row and a column of anelement having a highest value among elements of the interaction matrix.23. The system for generating a video synopsis according to claim 1,wherein the tube generation unit is further configured to label thesource object tube with at least one of source object relatedinformation or source video related information as a result of thedetection of the source object, and interaction determined to beassociated with the source object of the tube.
 24. The system forgenerating a video synopsis according to claim 23, further comprising: asource DB to store at least one of the source object tube and thelabeled data.
 25. The system for generating a video synopsis accordingto claim 1, wherein the video synopsis generation unit is configured togenerate the video synopsis in response to a user query including asynopsis object and a synopsis interaction to be required for synopsis.26. The system for generating a video synopsis according to claim 25,wherein the video synopsis generation unit is further configured to:determine the source object corresponding to the synopsis object in thedetected source object, and select the source object for generating thevideo synopsis by filtering the source object associated with theinteraction corresponding to the synopsis interaction in the selectedsource object.
 27. The system for generating a video synopsis accordingto claim 25, wherein the video synopsis generation unit is furtherconfigured to: select the source object associated with the interactioncorresponding to the user query as a tube for synopsis, and arrangingthe selected tubes; and generate the video synopsis based on theselected tubes and a background.
 28. The system for generating a videosynopsis according to claim 27, wherein the video synopsis generationunit is further configured to determine a start time of selectedsynopsis object tubes with minimized collision between the selectedsynopsis object tubes.
 29. The system for generating a video synopsisaccording to claim 25, wherein the video synopsis generation unit isfurther configured to: group tubes having a same synopsis interaction togenerate a group for each interaction when the user query includesmultiple synopsis interactions,
 30. The system for generating a videosynopsis according to claim 29, wherein the video synopsis generationunit is further configured to: determine start times of each group,wherein the start times minimizes collision between each group, arrangethe plurality of groups based on each start time of each group.
 31. Thesystem for generating a video synopsis according to claim 30, forarranging the plurality of groups, wherein the video synopsis generationunit is configured to determine start times of the selected synopsisobject tubes in a same group based on a shoot time in the source video.32. The system for generating a video synopsis according to claim 27,for generating the video synopsis based on the selected tubes and abackground, wherein the video synopsis generation unit is configured tostitch the source object tubes in which at least part of the backgroundof the source video is removed with the background of the source video.